algod importer: Update sync on WaitForBlock error. #122

winder · 2023-07-20T16:40:36Z

Summary

If algod is restarted after it receives a sync round update but before it fetches the new round(s), then the algod follower and conduit will stall. Conduit will keep waiting for algod to reach the new sync round but it never happens.

This change adds some extra logic to the WaitForBlock call. If there is a timeout or a bad response, a new attempt to set the sync round is made.

This PR also removes the retry loop from the algod importer. Retry is now managed by the pipeline.

Test Plan

Update existing unit tests.

codecov · 2023-07-20T16:46:20Z

Codecov Report

Merging #122 (042ec0b) into master (442791a) will increase coverage by 2.71%.
The diff coverage is 77.26%.

@@            Coverage Diff             @@
##           master     #122      +/-   ##
==========================================
+ Coverage   67.66%   70.37%   +2.71%     
==========================================
  Files          32       36       +4     
  Lines        1976     2535     +559     
==========================================
+ Hits         1337     1784     +447     
- Misses        570      654      +84     
- Partials       69       97      +28

Impacted Files	Coverage Δ
conduit/data/block_export_data.go	`100.00% <ø> (+92.30%)`	⬆️
conduit/metrics/metrics.go	`100.00% <ø> (ø)`
conduit/pipeline/metadata.go	`69.11% <ø> (ø)`
...duit/plugins/exporters/filewriter/file_exporter.go	`81.63% <ø> (-1.06%)`	⬇️
conduit/plugins/exporters/postgresql/util/prune.go	`78.43% <ø> (ø)`
conduit/plugins/importers/algod/metrics.go	`100.00% <ø> (ø)`
...ins/processors/filterprocessor/filter_processor.go	`83.82% <ø> (+3.54%)`	⬆️
...plugins/processors/filterprocessor/gen/generate.go	`34.28% <ø> (ø)`
conduit/plugins/processors/noop/noop_processor.go	`64.70% <ø> (+6.81%)`	⬆️
pkg/cli/internal/list/list.go	`20.75% <ø> (ø)`
... and 15 more

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Eric-Warehime

Looks correct to me. I'm not sure I follow how this causes the pipeline to hang though.

Last I checked if you stop/start the node it will have the last MaxAcctLookback deltas in cache (and even more rounds available). And it will also run ahead MaxAcctLookback-1 rounds.

So unless that number is 1, the node/pipeline should make progress despite the sync round being 1 round lower than what we expect. And the pipeline would correctly update the sync round once it processed another round.

winder · 2023-07-20T19:23:18Z

Looks correct to me. I'm not sure I follow how this causes the pipeline to hang though.

Last I checked if you stop/start the node it will have the last MaxAcctLookback deltas in cache (and even more rounds available). And it will also run ahead MaxAcctLookback-1 rounds.

So unless that number is 1, the node/pipeline should make progress despite the sync round being 1 round lower than what we expect. And the pipeline would correctly update the sync round once it processed another round.

I don't totally understand it either. I'm guessing there is some sort of cooldown / warmup time when rounds are being processed very quickly. For the file processor each round is being processed in the 50-200µs range.

I was able to confirm that it's the case that the sync round needs to be called (this is with MaxAcctLookback = 64):

cat metadata.json
{"genesis-hash":"mFgazF+2uRS1tMiL9dsj01hJGySEmPN28B/TjjvpVW0=","network":"betanet","next-round":609262}

curl -XGET "localhost:4190/v2/ledger/sync?pretty" -H "Authorization: Bearer aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
{
  "round": 609198
}

tzaffi · 2023-07-20T22:08:34Z

This PR also removes the retry loop from the algod importer. Retry is now managed by the pipeline.

👍

tzaffi

This looks correct to me.

I made suggestions about rewording some comments, possibly using errors.Join, and keeping a higher timeout value.

The following thought strikes me. During yesterday's standup it sounded like around 1 in 3 shutdowns of algod during catchup would result in this bug. So if one brings algod down and up enough times we could practically guarantee the bug. It may even be possible to simulate this issue reliably in our short-duration E2E tests.

conduit/plugins/importers/algod/algod_importer.go

tzaffi · 2023-07-21T13:55:49Z

conduit/plugins/importers/algod/algod_importer.go

-const (
-	retries = 5
+var (
+	waitForRoundTimeout = 5 * time.Second


Nit:

A more conservative timeout would be 45 seconds. I agree that we want to endow conduit with a greater amount of determinism regarding the outcome of each call to the waitForBlock endpoint. So it's a good idea to make the call to the endpoint timeout on it's own terms rather than the endpoint's as is being done in the PR. On the other hand, we might still want the ability to keep 10 threads of the algod importer running concurrently after we've all caught up, and a 45 sec timeout would allow for that. If we narrow the timeout to 5 secs, we essentially only allow one or two algod importer threads to run at a time (probably only one due to round time variability).

On the other hand, we can change the value as aggressively as in the PR, and if the need arises in the future to raise it back to 45 secs we can do it.

The low timeout was intended for responsiveness, basically when the node is stalled the timeout needs to elapse before the first recovery attempt. If there's a timeout I'm expecting the pipeline to retry the call.

The default retry count is 5, now I'm wondering if it should be unlimited.

The old Indexer had a package called fetcher, I wonder if we should bring that back to manage more optimal round caching: https://github.com/algorand/indexer/blob/master/fetcher/fetcher.go#L1

A worthwhile thought for a future PR or even the pipelining effort. (suggest keeping this thread unresolved for future reference)

I'll change the default retry timeout to 0 in a followup PR, it's probably a good default anyway since people have expressed appreciation for Indexer working that way.

conduit/plugins/importers/algod/algod_importer.go

tzaffi

Approving, even though I'm still curious if creating an E2E test is viable. That can be left as a future exercise.

winder added 4 commits July 20, 2023 11:28

Bugfix

4284e64

WIP

7598a48

Fix tests.

44b7735

Update test name.

77e247f

winder added Team Lamprey Bug-Fix PR proposing to fix a bug labels Jul 20, 2023

winder requested review from tzaffi, Eric-Warehime and a team July 20, 2023 16:40

winder self-assigned this Jul 20, 2023

Linter fix

3cbacad

winder marked this pull request as ready for review July 20, 2023 17:08

Add helpful comment to test.

610a7b9

Eric-Warehime approved these changes Jul 20, 2023

View reviewed changes

winder added 3 commits July 20, 2023 15:28

Be slightly more aggressive with the SyncError.

67a58eb

Fix tests, add another test.

2f5dd92

Fix the other test...

72c5dd4

tzaffi reviewed Jul 21, 2023

View reviewed changes

PR feedback.

fc9a9b5

winder requested a review from tzaffi July 21, 2023 14:43

tzaffi reviewed Jul 21, 2023

View reviewed changes

conduit/plugins/importers/algod/algod_importer.go Outdated Show resolved Hide resolved

Help distinguishing unknown errors.

042ec0b

tzaffi approved these changes Jul 21, 2023

View reviewed changes

winder merged commit f2711fa into algorand:master Jul 21, 2023
3 checks passed

winder deleted the will/update-sync-on-error branch July 21, 2023 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

algod importer: Update sync on WaitForBlock error. #122

algod importer: Update sync on WaitForBlock error. #122

winder commented Jul 20, 2023

codecov bot commented Jul 20, 2023 •

edited

Loading

Eric-Warehime left a comment

winder commented Jul 20, 2023

tzaffi commented Jul 20, 2023

tzaffi left a comment

tzaffi Jul 21, 2023

winder Jul 21, 2023

tzaffi Jul 21, 2023 •

edited

Loading

winder Jul 21, 2023

winder Jul 21, 2023

tzaffi left a comment

algod importer: Update sync on WaitForBlock error. #122

algod importer: Update sync on WaitForBlock error. #122

Conversation

winder commented Jul 20, 2023

Summary

Test Plan

codecov bot commented Jul 20, 2023 • edited Loading

Codecov Report

Eric-Warehime left a comment

Choose a reason for hiding this comment

winder commented Jul 20, 2023

tzaffi commented Jul 20, 2023

tzaffi left a comment

Choose a reason for hiding this comment

tzaffi Jul 21, 2023

Choose a reason for hiding this comment

winder Jul 21, 2023

Choose a reason for hiding this comment

tzaffi Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

winder Jul 21, 2023

Choose a reason for hiding this comment

winder Jul 21, 2023

Choose a reason for hiding this comment

tzaffi left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 20, 2023 •

edited

Loading

tzaffi Jul 21, 2023 •

edited

Loading